Learning multilayer perceptrons efficiently 
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A learning algorithm for multilayer perceptrons is pre- 
sented which is based on finding the principal components 
of a correlation matrix computed from the example inputs 
and their target outputs. For large networks our procedure 
needs far fewer examples to achieve good generalization than 
traditional on-line algorithms. 

PACS: 84.35. +i,89. 80. +h, 64.60. Cn 

Multilayer neural networks achieve the ability to ap- 
proximate any reasonable function arbitrarily well |Q by 
interconnecting sufficiently many architecturally identi- 
cal processing elements, the perceptrons. This replica- 
tion of identical elements invariably leads to symmetries 
in the multilayer architecture which need to be broken 
during training to achieve good generalization 0] . In the 
dynamics of gradient based training procedures the sym- 
metries give rise to suboptimal fixed points and slow con- 
vergence. In particular this holds for stochastic gradient 
descent methods which to date have been the method of 
choice for training large networks because, at each time 
step, the update of the network parameters is based on 
the presentation of only a single training example and 
hence computationally cheap. Such stochastic gradient 
descent methods have been intensively studied in the 
framework of on-line learning, where each of the train- 
ing examples is used just once by the network ||. This 
is attractive because there is no need to store the entire 
set of training examples, and the approach also works for 
nonstationary problems. However, in the on-line frame- 
work the number of required training examples is coupled 
to the slow temporal evolution caused by the symmetries 
and thus unreasonably large training sets are necessary. 
In fact the ratio of the number of examples needed for 
good generalization to the number of free parameters in 
the network diverges with the network size While 
there have been investigations into optimizing the on- 
line dynamics these have not lead to practical al- 
gorithms since the optimized procedures assume that the 
symmetry breaking provided by the initial conditions is 
macroscopic and known to the learning algorithm. 

In this letter we present a learning algorithm which has 
many of the attractive features of the traditional on-line 
procedures but yields good generalization using a much 
smaller number of training examples. Further an exact 



analysis of the algorithm's performance shows that the 
ratio of required examples to the number of free param- 
eters in the networks does stay finite in the thermody- 
namic limit. 

The multilayer architecture we analyze is a committee 
machine with K hidden units defined by 



A" 



(1) 



where £ £ M N is the input and the Bi £ Ft N are the 
unknown parameter vectors. The goal of learning is to 
estimate these parameter vectors, which we shall refer 
to as teacher vectors, from a training set of P examples 
t(£^)) of the input/output relationship. We shall ini- 
tially focus on regression problems and later sketch the 
modifications to our procedure needed for classification 
tasks. For regression, the output function g is usually 
assumed invertible, and this easily reduces to the linear 
case by applying the inverse function to the target out- 
puts. So in this case we simply assume g(x) — x. For 
brevity we also assume that the Bi are orthonormal. 

In its simplest form our procedure can be seen as gener- 
alizing Hebbian learning. There the parameter vector J p 
of a simple perceptron approximating the target function 
t(£) is obtained as the average 



J P = P 
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Our basic observation is that when r(£) is given by a 
multilayer perceptron it is important to not only consider 
the empirical mean of the distribution of t^)^ as in 
plain Hebbian learning but also its correlation matrix. 
In particular some of the eigenvectors of the correlation 
matrix correspond to the parameter vectors Bi and thus 
the supervised learning task is mapped onto the well un- 
derstood problem of principal component analysis. While 
this mapping by itself does not solve the original problem, 
it does provide a crucial reduction in its dimensionality, 
such that the remaining problem becomes almost trivial 
in the thermodynamic limit N — ► oo. 

We shall consider the general correlation matrix 
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where the simple choice F(x) — x 2 for the weight func- 
tion F yields the analogy to Hebbian learning. Assuming 
the components of the input vectors £ M to be independent 
Gaussian random variables with zero mean and unit vari- 
ance, it is straightforward to analyze the spectrum of C p 
in the limit P — > oo. The spectrum then only has three 
eigenvalues Ao, A and Aa- The degeneracy of A is N — K 
and any vector orthogonal to all teacher vectors Bi is an 
eigenvector. The eigenvalue A has the single eigenvector 
B = K^ 1 ! 2 Yli=i Bi; this eigenvector is of little interest 
since for large P one also has B cx J p , and it is thus 
simpler to use Hebb's rule (|2|) to estimate B. The im- 
portant eigenspace is the one of Aa since it is spanned 
by the K - 1 vectors B x - B j (j = 2, . . . , K). The three 
eigenvalues can be written as averages over independent, 
zero mean, unit variance, Gaussian random variables 
Ui. For instance one obtains Aa = | (F(r y )(yi — V2) 2 ) y 

where r y — K^ 1 / 2 Yli=i h{Ui)- The activation function 
h is sigmoidal (odd, monotonic and bounded), and when 
stating specific numerical values we will always assume 
h(y) = erf(y). Then for the choice F(x) = x 2 one finds 
the ordering A > Ao > Aa and Ao — Aa = j^j^ (1 — 4g). 

For a finite number P of training examples the degen- 
eracy in the spectrum is broken by random fluctuations. 
But a computation of the orthonormal eigenvectors A p 
corresponding to the K — 1 smallest eigenvalues of C p 
nevertheless yields an estimate of the space spanned by 
the difference vectors Bi — Bj. To measure the success 
of approximating this space, we introduce the overlap 

P=(K-1)- 1 ^ 2 Tt(A t BB t A) 1 / 2 , (4) 

where B is the matrix (B\, . . . , Bk) of the teacher vec- 
tors and A = (Af , . . . , A^_J. This is a sensible mea- 
sure because p is invariant with respect to the choice of 
an orthonormal basis of the space spanned by the A P 
and since it attains its maximal value of 1 iff all A P lie 
in the space spanned by the Bi. Simulations showing the 
overlap p as function of the number of examples per free 
parameter, a = P/(KN), are depicted in Fig. 1. They 
indicate a second order phase transition from zero to pos- 
itive p at a critical value a c . The evolution of p can be 
calculated exactly in the thermodynamic limit N — > oo. 
But since up to now we are only estimating the space 
spanned by the Bi, instead of the vectors themselves, we 
address this problem first and defer the calculation of p. 

Using the above eigenspace procedure, the Bi can be 
approximated by linear combinations of the A P and 
J p . Thus the original K N dimensional problem is re- 
duced to K 2 dimensions and this is much easier for 
large N. To exploit the dimensionality reduction, we 
write the possible linear combinations as Ar where A = 
( J p , Af , . . . , A€-_ 1 ), r is a K by K matrix of tunable 
parameters, and we define a student network ar in the re- 
stricted space via a r (0 = g (r-V 2 J2? =1 h ((Ar)f^)V 



We want to choose T to minimize the generalization er- 
ror e g = (^(t(£) — crr(£)) 2 )j; and to this end P additional 

examples r(^ t/ )), v = 1, . . . ,P, are used. To simplify 
the theoretical analysis, the additional examples should 
be picked independently of the examples used to obtain 
A. We then apply the standard on-line gradient descent 

procedure Y v+1 = T v - r/V r (r(i v ) ~ (rr»(i v )) ■ How- 
ever, by choosing a scaling of the learning rate r\ such 
that r] <C 1 for large N, the stochasticity drops out of 
the procedure Q and in the restricted space it performs 
gradient descent in e g . Further we can scale P such that 

r)P ^> 1 and then T p will be a minimum of e g for large 
N. Finally both scaling conditions can be satisfied while 
observing P -C N, so that the required number of addi- 
tional examples is negligible compared to the size of the 
first training set. Note that thanks to the reduced dimen- 
sionality the details of the scaling of ij and P only affect 
finite size effects and not the performance in the thermo- 
dynamic limit. Simulations combining the two stages of 
our algorithm (Fig. 1) show a steady decay in e g . 

We now turn to the theoretical calculation of p in 
the important first stage of the algorithm. The smallest 
eigenvalue of C p can be found by minimizing J T C P J un- 
der the constraint \J\ = 1. Hence we consider the parti- 
tion function Z = J <iJexp(— (3PJ T C P J) where the inte- 
gration is over the iV-dimensional unit sphere. For large 
N the typical properties of the minimization problem are 
found by calculating the training set average (In Z) and 
taking the limit [3 — * oo. Using replicas, one introduces 
the K dimensional order parameter R , the typical over- 
lap B T J of the teacher vectors with a vector J drawn 
from the Gibbs distribution Z _1 exp(— f3PJ T C p J), and 
further the replica symmetric scalar order parameter q 
which is the overlap J 1 J 2 of two vectors drawn from 
this distribution. For P > N the correlation matrix C p 
is nonsingular, q approaches 1 with increasing (3, and the 
relevant scaling is \ — — q) = 0(1). Then for the 
quenched average of In Z one finds in the limit (3 — * oo 

- extr R T A(a, X )R + o(a, x) . (5) 

Here a(a, \) — —olK (G x {r y )) y + the if by if matrix 
A(a, x) has entries 

A jk (a, X ) = -ocK (G x (T y )(y 3 y k - S jk )) y - || , (6) 

and G x (t v ) = F(t v )/(1 + 2 X F(t v )). Since Eq. (|) is 
quadratic in R the extremal problem can only have a 
solution R 7^ if the matrix A is singular. From the 
symmetries one easily obtains that A has just two eigen- 
values. The first can be written as An — An, it is the 
relevant eigenvalue in our case ^] and its eigenspace is 
spanned by the vectors e\ — ej (j = 2, . . . ,K), where 



2 



ei, . . . , e# denote the standard basis of M K . This degen- 
eracy shows that the difference between the K — 1 small- 
est eigenvalues of the correlation matrix vanishes for large 
N . So the simple procedure of analyzing the properties 
of the single vector J minimizing J T C P J, in fact yields 
the properties of the K — 1 eigenvectors vectors of Cp 
with smallest eigenvalues in the thermodynamic limit. 

Due to the degeneracy, we can reparametrize (||) set- 
ting R = p(ei — ej)/y/2 and obtain an extremal problem 
with only two variables p and \- Note that p is indeed 
the parameter introduced in the analysis of the numerical 
simulations. Its evolution is now obtained by solving (|5|) 
and this confirms the continuous phase transition found 
in the simulations from p = to positive p at a critical 
value a c . For K = 2 one finds a c = 4.49 and for K = 3 
the result is a c = 8.70. As shown in Fig. 1 beyond the 
phase transition there is excellent agreement between the 
theoretical prediction and the simulations. 

To obtain generic results for large K note that the 
contributions of y\ and 2/2 to the target output t v will be 
small in this limit. Decomposing r y as r y = t* + S y /yK 
where S y — h(yi) + ft. (2/2), and expanding G x {t v ) up to 
second order for large K simplifies Eq. [| to: 



(lnZ) 

(3N 



extr-^<G^(r;)) y <^(( yi - y2 ) 2 -2)) 



-aK (G x (r y )) y + 



1-P 2 
2X 



(7) 



On the one hand, applying the central limit theorem, the 
multiple integrals (G'^{t*)) and {G x (T y )) y can now be 
replaced by a single average, on the other hand the struc- 
ture of (0) shows that \ approaches zero with increasing 
K . These observations yield the very simple result that 
p 2 = 1 — a c /a for large K and a > a c . The value of a c 
is obtained as 



Or 



AK(F 2 {pz)) z 



{F"(pz)f z [{z 2 h 2 {z)) z - p? -2(zh(z))l)' 



(8) 



where p 2 = (ft 2 (z)) z , and the distribution of z is Gaus- 
sian with zero mean and unit variance. It is now straight- 
forward to derive the optimal choice of F from Eq. 
(§|) by noting that in the denominator {F"(pz)) z — 
p~ 2 ((z 2 — 1)F (fiz)} . Applying the Cauchy-Schwarz in- 
equality to x F 2 {pz)) z / (F"(pz)) z then yields that the 
optimal choice is F(x) = x 2 — p 2 . For this choice the 
eigenvalue Ao of the correlation matrix C p vanishes for 
large P. So the optimal F maximizes the signal to noise 
ratio between the eigenspace of difference vectors we want 
to estimate and the orthogonal space. 

For the activation function h(x) = erf(x) one finds 
that a c — 1.96-ftT when F is optimal, whereas the simple 
choice F(x) = x 2 yields a critical a which is one and a 
half times higher. Simulation results for K = 7 plotted 



in Fig. 2 show that the large K theory provides a reason- 
able approximation already for networks with quite few 
hidden units. 

To round off the analysis of the regression problem, 
we obtain a theoretical prediction for the generalization 
error achieved by combining the two stages of our pro- 
cedure. A simple calculation shows that the overlap r of 
the Hebbian vector J M with B, r = B T J M /| J^|, for large 
N satisfies r = (1 + alCB ^{V3) )-i/2_ p ur ther, using the 
explicit expression for e g given in H and the values r and 
p, we can calculate the minimum of e g in the restricted 
space and find a theoretical prediction for the generaliza- 
tion error obtained by the second stage. This prediction 
is compared to the simulations in Fig. 1 and 2. 

We next consider classification problems, that is we 
assume that the output function of the network given 
by Eq. ([!]) is g{x) — sgn(x). Then, since the out- 
put is binary, Ao = A holds for any choice of the 
weight function F, and our procedure cannot immedi- 
ately be applied. However, it is possible to gain in- 
formation about the target rule by not just consider- 
ing its output but by comparing the output to the 
value B T ^ which would have been obtained by the lin- 
earized network, g(x) — h(x) — x. This is feasible since, 
even for classification, B can be estimated by the Heb- 
bian vector J p defined by (0). We are thus lead to 
consider more general correlation matrices of the form 



A reason- 



able way of choosing F is to focus on inputs £ M where the 
target output has a different sign than the output of the 
linearized network. So we consider F(x, y) = Q(—xy)—p, 
where O is the Heavyside step function. 

In the large P limit, the matrix C p has the same 
eigenspaces as in the case of regression and the three 
eigenvalues will in general be different. For the acti- 
vation function we chose h (x) — sgn(x) to compare 
our results with the findings for on-line Gibbs learn- 
ing 1 10 1, to our knowledge the only algorithm which has 
been simulated in any detail for classifications task with 
a connected committee machine. For this h one finds 
Aa > A > A, and for large K to leading order Aa — Ao = 
\f2ir — 4/ (ir 2 K). Numerical simulations of the procedure 
are shown in Fig. 3 using p = n^ 1 arccos \/2/n. Moti- 
vated by our findings in the case of regression, this choice 
of p yields Ao = for large K. While the training sets 
needed to achieve good generalization are much larger 
than for regression, they are, for exactly the same archi- 
tecture, approximately 20-times smaller than for on-line 
Gibbs learning jn| already for N = 150. 

Throughout this Letter we have assumed that the in- 
puts are drawn from an isotropic distribution. Typ- 
ically, in practical applications, the input data itself 
will have some structure, and in this case the inputs 
have to be whitened before applying our procedure. To 
this end one computes the correlations of the inputs by 
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D p = P^ 1 53u=i £ m £ mT ■ Then instead of just consider- 
ing C p , one finds the extremal eigenvectors Aj of the 
transformed matrix R~ lT C p Rr 1 , where the square ma- 
trix R is the Cholesky factorization of D p , R T R = D p . 
Then an estimate for the space spanned by the differ- 
ence vectors is obtained by transforming back, setting 
Aj = i? _1 Aj. Numerical simulations for F(x) — x 2 show 
that whitening noticeably improves the performance of 
our algorithm even when the inputs are picked from an 
isotropic distribution. This is due to the fact that the 
spectrum of the empirical correlation matrix D p has a 
finite width even when it converges to a delta peak with 
increasing P. 

In summary, we have presented the first learning al- 
gorithm for a realistic multilayer network which can be 
exactly analyzed in the thermodynamic limit and yields 
good generalization already for a finite ratio of training 
examples to free parameters. This contrasts with the be- 
havior of traditional on-line procedures jUJ] for target 
functions such as the ones considered in this Letter. As 
long as there are on the order of N on-line steps, the dy- 
namics is exactly described by deterministic differential 
equations for a set of self-averaging order parameters ]T^ ] 
in the limit of large N. However, even when tuned by 
choosing a good learning rate, the dynamics is stuck in a 
badly generalizing plateau on this time scale unless some 
information about the target function is built into the ini- 
tial conditions. For the realistic case of randomly chosen 
initial values, the symmetries are eventually broken due 
to small initial fluctuations and the accumulation of fluc- 
tuations during the dynamics, but only when the number 
of on-line steps is much larger than N. This symmetry 
breaking is not adequately described by the deterministic 
differential equations. When the training set is sampled 
without replacement, this divergence of the time scale 
means that large numbers of examples are needed. But 
even in the recently analyzed scenario of sampling with 
replacement | fl3| , the above theoretical problems remain 
and are compounded since the analysis requires involved 
techniques such as dynamical replica theory and quite a 
few approximations already on the time scale N. 

Hence, we believe that the findings in this Letter open 
new avenues for research and that in particular the per- 
formance of the algorithm for classification tasks deserves 
a more detailed analysis. But the perhaps most intrigu- 
ing aspect of our procedure is that it does not really 
assume that the number of hidden units in the architec- 
ture is known a priori. This is highlighted by the inset of 
Fig. 2 showing that the number of hidden units can be 
determined by inspecting the eigenvalue spectrum of the 
correlation matrix. So our findings also provide a novel 
perspective on the problem of model selection. 

The work of two of us (C.B. and R.U.) was supported 
by the Deutsche Forschungsgemeinschaft. 
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FIG. 1. Results for K = 3. The evolution of p for N = 400 
(o) and N = 1600 (•). The right axis relates to the values 
of e g found when the two stages of our procedure are com- 
bined, N = 400 (□) and N = 1600 (■). For e g the value 
of a refers to the number of examples in both training sets, 
a — (P + P)/KN. The full lines show the theoretical predic- 
tion for the thermodynamic limit. Where not shown, error- 
bars are smaller than the symbol size. 
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FIG. 2. Results for K — 7 using the weight function 
F(x) = x 2 — p 2 . The numerical simulations for TV = 2000 are 
compared to the theoretical curves found in the large K limit. 
Where not shown, errorbars are smaller than the symbol size. 
The inset shows a histogram of the 200 smallest eigenvalues 
of C p for a single training set at a — 22. A gap separates 
the 6 smallest eigenvalues from the rest of the spectrum. The 
range of the shown eigenvalues is [—0.1, —0.07]. 
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FIG. 3. Numerical simulations of the classification problem 
for K = 7, N = 150 showing the values of p (o) and e g (□). 
The errorbars for the p values are smaller than the symbol 
size. Here e g is the probability of misclassifying a new input. 
Training in the restricted space during the second stage of 
our procedure uses the variant of the perceptron learning rule 
described in 11 ill for tree committee machines. 
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